Introduction¶

Author: Lukas Reber

This notebook contains all the code for the Deep Learning Mini Challenge 2. In this MC, we implement a version of the Paper Show and Tell: A Neural Image Caption Generator. The implemented model contains of a CNN encoder and a LSTM decoder and is trained to create captions for images. The model is trained on the Flickr8k dataset. This dataset contains 8091 images with 5 captions each.

Due to limited local resources and the rather high computational cost of training the models, the training was done on Google Colab. Therefore this notebook contains certain colab specific code, which might not be working on other environments. The trained models were then saved and loaded in this notebook for evaluation.

Imports and Functions¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import ast
import matplotlib.style as style
from PIL import Image
import matplotlib.patches as patches
from tqdm.notebook import tqdm

from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import torchtext; torchtext.disable_torchtext_deprecation_warning()
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import vocab, GloVe
from torchtext.vocab import build_vocab_from_iterator
from collections import Counter
from torchmetrics.text.bleu import BLEUScore


#from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from torchmetrics.text import BLEUScore

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'mps' if torch.backends.mps.is_available() else device
device = 'cpu'
print(f'Using device: {device}')

style.use('ggplot')
np.random.seed(0)
torch.manual_seed(0)

# because of memory issues, we need to set the high watermark to 0.0
os.environ['PYTORCH_MPS_HIGH_WATERMARK_RATIO'] = '0.0'

print(f'torch version: {torch.__version__}')
print(f'torchtext version: {torchtext.__version__}')
print(f'torchvision version: {torch.__version__}')
Using device: cpu
torch version: 2.3.0
torchtext version: 0.18.0
torchvision version: 2.3.0
In [ ]:
if 'COLAB_GPU' in os.environ:

  from google.colab import drive
  drive.mount('/content/drive/')
  print('Running on CoLab')
  captions = pd.read_csv('/content/drive/MyDrive/del/data/captions.txt')
  data_path = '/content/drive/MyDrive/del/data'
  image_path = '/content/Images'
  image_path_prepared = '/content/Images_prepared'
else:
  captions = pd.read_csv('data/captions.txt')
  data_path = 'data'
  image_path = 'data/Images'
  image_path_prepared = 'data/Images_prepared'
# Get a list of all files in the directory
all_images = os.listdir(image_path)
In [ ]:
# Only run this cell wenn on google colab
!unzip /content/drive/MyDrive/del/archive.zip
In [ ]:
captions = pd.read_csv('data/captions.txt')
image_path = 'data/Images'
# Get a list of all files in the directory
all_images = os.listdir(image_path)

Data Analysis¶

In [ ]:
# Nr of captions
print(f'Nr of captions: {captions.shape[0]}')
# Nr of images
print(f'Nr of images: {len(os.listdir(image_path))}')
Nr of captions: 40455
Nr of images: 8091

Sample Images¶

Initially we visualize some sample images from the dataset.

In [ ]:
# display sample images with captions
fig, axs = plt.subplots(5, 3, figsize=(20, 30))

for i in axs.flatten():
    img_file = np.random.choice(os.listdir(image_path))
    img = plt.imread(os.path.join(image_path, img_file))
    i.imshow(img)
    caps = '\n'.join(captions[captions['image'] == img_file]['caption'].values)
    i.text(5, 5, caps, fontsize=8, horizontalalignment='left', verticalalignment='top', color='white', bbox=dict(facecolor='black'))
    i.axis('off')
fig.tight_layout()
plt.show()
No description has been provided for this image

We can see from the sample images that the images are of different sizes and have different aspect ratios. This is important to keep in mind when preprocessing the images. Adittionally, we can see that the captions generally focus on the main object of the image. For example, if there is a person in the image, all of the captions will mention the person but with different descriptive features.

Caption Length¶

In [ ]:
captions['lenght'] = captions['caption'].apply(lambda x: len(x.split()))
captions['n_characters'] = captions['caption'].apply(lambda x: len(x))
In [ ]:
max(captions['lenght'])
np.mean(captions['lenght'])
Out[ ]:
11.78259794833766
In [ ]:
captions['lenght'].hist(bins=20)
plt.title('Distribution of caption lenght')
plt.xlabel('Caption lenght')
plt.ylabel('Number of captions')
plt.show()
No description has been provided for this image

The plot shows the distribution of caption lengths in the dataset. We can see that the majority of captions have a length below 20 words.

In [ ]:
captions['n_characters'].hist(bins=20)
plt.title('Distribution of number of characters in caption')
plt.xlabel('Number of characters')
plt.ylabel('Number of captions')
plt.show()
No description has been provided for this image

Word occurences¶

In [ ]:
words = captions['caption'].str.split().sum()
words = pd.Series(words)
word_counts = words.value_counts()

word_counts.head(50).plot(kind='bar', figsize=(20, 10))
plt.title('Top 50 most common words in captions')
plt.xlabel('Word')
plt.ylabel('Number of occurences')
plt.show()
No description has been provided for this image

The plot shows the occurences of the 50 most common words. The captions have not yet been preprocessed, so can see that punctuations are still present and words may occured in both lower and uppercase. Unsurprisingly, the most common words are "a", "in", "the" and "on".

Image Size / Ratio¶

In [ ]:
sample_images = np.random.choice(all_images, 2000)

fig, ax = plt.subplots(1,1)
for image in sample_images:
    img = Image.open(os.path.join(image_path, image))
    width, height = img.size
    position = (500- width/2, 500 - height/2)
    rectangle = patches.Rectangle(position, width, height, edgecolor='blue', facecolor='none', alpha=0.01, linewidth=1)
    ax.add_patch(rectangle)

plt.title('Image sizes of 2000 random images')
plt.xlabel('Width (px)')
plt.ylabel('Height (px)')

ax.set_xlim(0, 1000)
ax.set_ylim(0, 1000)
ax.set_aspect('equal')
plt.show()
No description has been provided for this image

In order to find a good size for the images, we look at 2000 sample images and plot the aspect ratios of these images. We can see that there are two main clusters of aspect ratios. One where the images are 500px in height and around 400px in width and one where the images are 500px in width and around 400px in height. The goal is to get the images into a common size without distorting them or lossing/adding to much information.

In [ ]:
# Max image width and height
max_width = 0
max_height = 0
for image in all_images:
    img = Image.open(os.path.join(image_path, image))
    width, height = img.size
    max_width = max(max_width, width)
    max_height = max(max_height, height)

print(f'Max image width: {max_width}')
print(f'Max image height: {max_height}')
Max image width: 500
Max image height: 500

Data prearation¶

Prepare Images¶

Since not all images are the same size, but this is a requirement for the model, we will add padding to the images to make them all the same size. The maximum width and height from all images is 500 pixels, so we will resize all images to 500x500 pixels. Other options would be to either crop the images, but this would result in loss of information, which we want to avoid, or repeat the border pixels to fill the image, but this would result in a lot of noise in the images.

In [ ]:
max_width = 500
max_height = 500

for image in tqdm(all_images):
    img = Image.open(os.path.join(image_path, image))
    width, height = img.size
    # cacluate padding
    pad_left = (max_width - width) // 2
    pad_right = max_width - width - pad_left
    pad_top = (max_height - height) // 2
    pad_bottom = max_height - height - pad_top
    # create padding
    padded_img = transforms.Pad(padding=(pad_left, pad_top, pad_right, pad_bottom), fill=0, padding_mode='constant')(img)
    # save image
    padded_img.save(f'data/Images_prepared/{image}')
In [ ]:
# check all image sizes are 500x500
for image in all_images:
    assert Image.open(os.path.join(image_path_prepared, image)).size == (500,500)

Preprocess and tokenize captions¶

In order for the model to understand the captions, we need to tokenize them, which means converting the words to integers. But first, a couple of clean up steps are needed:

  • Lowercase all words
  • Remove punctuation
  • Limit the maximum number of words to 20
  • Add padding to captions shorter than 20 words
  • Mark beginning with <bos> and end with <eos>, which will result in a total token lenght of 22
In [ ]:
tokenizer = get_tokenizer('basic_english')
tokenizer(captions.iloc[1]['caption'])
Out[ ]:
['a', 'girl', 'going', 'into', 'a', 'wooden', 'building', '.']
In [ ]:
begin_token = '<bos>'
end_token = '<eos>'
pad_token = '<pad>'
unk_token = '<unk>'

max_caption_lenght = 20

tokenizer = get_tokenizer('basic_english')

def convert_caption(caption):
    
    # tokenize caption
    process_caption = tokenizer(caption)
    # remove punctuations (only keep alphanumeric characters)
    process_caption = [word for word in process_caption if word.isalnum()]
    # shorten caption to max_caption_lenght
    process_caption = process_caption[:max_caption_lenght]
    # add begin and end token
    process_caption = [begin_token] + process_caption + [end_token]
    # add padding
    process_caption += [pad_token] * (max_caption_lenght + 2 - len(process_caption))
    
    return process_caption
    
captions['tokens'] = captions['caption'].apply(convert_caption)
In [ ]:
captions.head(10)
Out[ ]:
image caption lenght n_characters tokens
0 1000268201_693b08cb0e.jpg A child in a pink dress is climbing up a set o... 18 72 [<bos>, a, child, in, a, pink, dress, is, clim...
1 1000268201_693b08cb0e.jpg A girl going into a wooden building . 8 37 [<bos>, a, girl, going, into, a, wooden, build...
2 1000268201_693b08cb0e.jpg A little girl climbing into a wooden playhouse . 9 48 [<bos>, a, little, girl, climbing, into, a, wo...
3 1000268201_693b08cb0e.jpg A little girl climbing the stairs to her playh... 10 52 [<bos>, a, little, girl, climbing, the, stairs...
4 1000268201_693b08cb0e.jpg A little girl in a pink dress going into a woo... 13 57 [<bos>, a, little, girl, in, a, pink, dress, g...
5 1001773457_577c3a7d70.jpg A black dog and a spotted dog are fighting 9 42 [<bos>, a, black, dog, and, a, spotted, dog, a...
6 1001773457_577c3a7d70.jpg A black dog and a tri-colored dog playing with... 15 71 [<bos>, a, black, dog, and, a, dog, playing, w...
7 1001773457_577c3a7d70.jpg A black dog and a white dog with brown spots a... 19 86 [<bos>, a, black, dog, and, a, white, dog, wit...
8 1001773457_577c3a7d70.jpg Two dogs of different breeds looking at each o... 13 64 [<bos>, two, dogs, of, different, breeds, look...
9 1001773457_577c3a7d70.jpg Two dogs on pavement moving toward each other . 9 47 [<bos>, two, dogs, on, pavement, moving, towar...
In [ ]:
captions.to_csv('data/captions_prepared.csv', index=False)

Create Embeddings¶

Our network does not directly use the actual words from the captions in the dataset, instead it utilizes feature vectors from an embedding layer. The purpose of embedding is to convert the high-dimensional space of words into a lower-dimensional vector space, which improves training efficiency. To achieve this, we need to define the vocabulary and create embeddings for the words in this vocabulary.

There are two methods to create these embeddings:

  • Training Custom Embeddings: This involves learning the embeddings from scratch during the training process. While this approach is straightforward to implement, it may result in a slower learning curve and potentially limit the model’s performance since the embeddings also need to be trained. Also it is not garanteed that the embeddings with similar meaning are close to each other in the embedding space.
  • Using Pre-trained Embeddings: This involves using embeddings that have already been trained on large datasets, such as GloVe embeddings. Pre-trained embeddings can accelerate training and enhance model performance by leveraging existing knowledge about word relationships.

For this project, we will train models both with and without pre-trained embeddings and compare their training outcomes.

In [ ]:
# Expand the list of tokens and count the frequency of each word
vocab_count = Counter(captions['tokens'].explode())

begin_token = '<bos>'
end_token = '<eos>'
pad_token = '<pad>'
unk_token = '<unk>'
special_tokens = [begin_token, end_token, pad_token, unk_token]

def yield_tokens(data):
    for tokens in data:
        yield tokens


voc = build_vocab_from_iterator(yield_tokens(captions['tokens']), specials=special_tokens)
voc.set_default_index(voc[unk_token])


glove = GloVe(name='6B', dim=100)
vectors = glove.get_vecs_by_tokens(voc.get_itos())
voc.vectors = vectors
In [ ]:
def get_embeddings(tokens):
    # Get the embeddings for each token
    embeddings = [voc[token] for token in tokens]
    return embeddings

captions['embeddings'] = captions['tokens'].apply(get_embeddings)
captions.to_csv(os.path.join(data_path, 'captions_embeddings.csv'), index=False)
In [ ]:
# test if the embeddings are correct
sample = captions.sample(10)

for i, row in sample.iterrows():
    assert row['tokens'] == [voc.get_itos()[emb] for emb in row['embeddings']]

Data Splitting & Dataloader¶

In order to train the model and later evaluate it on unseen data, we need to split the data into a training and validation set. We will use 80% of the data for training and 20% for validation.

In [ ]:
# get unique files
unique_files = captions['image'].unique()
# split the files into train and test
files_train, files_test = train_test_split(unique_files, test_size=0.2, random_state=0) 

# split the captions into train and test
captions_train = captions[captions['image'].isin(files_train)]
captions_test = captions[captions['image'].isin(files_test)]

print(f'Nr of training captions: {captions_train.shape[0]}')
print(f'Nr of test captions: {captions_test.shape[0]}')
Nr of training captions: 32360
Nr of test captions: 8095

For pytorch to be able to work with the data, we need to create a dataloader. This dataloader will load the images and captions in batches. Additionally furhter transformations can be applied to the images if needed.

In [ ]:
# create a dataset class to read the caption and images
class ImageCaptionDataset(Dataset):
    def __init__(self, dataframe, image_path, transform=None):
        self.dataframe = dataframe
        self.image_path = image_path
        self.transform = transform

    def __len__(self):
        return self.dataframe.shape[0]

    def __getitem__(self, idx):
        row = self.dataframe.iloc[idx]
        image = Image.open(os.path.join(self.image_path, row['image']))

        if self.transform:
            image = self.transform(image)

        else:
            image = transforms.ToTensor()(image)

        return image, torch.tensor(row['embeddings'])
In [ ]:
transform = transforms.Compose([
    transforms.Resize((160, 160)),
    transforms.ToTensor(),
])
train = ImageCaptionDataset(captions_train, image_path=image_path_prepared, transform=None)
test = ImageCaptionDataset(captions_test, image_path=image_path_prepared, transform=None)
In [ ]:
batch_size = 64
shuffle = True

dataloader_train = DataLoader(train, batch_size=batch_size, shuffle=shuffle)
dataloader_test = DataLoader(test, batch_size=batch_size, shuffle=shuffle)
In [ ]:
# return a single batch from the dataloader
images, captions = next(iter(dataloader_train))
print(f'Images shape: {images.shape}')
print(f'Captions shape: {captions.shape}')
Images shape: torch.Size([64, 3, 500, 500])
Captions shape: torch.Size([64, 22])

The shape of the images tensor is [batch_size, num_channels, height, width], and the shape of the captions tensor is [batch_size, sequence_length]

Model definition¶

Our model consists of two parts: an encoder and a decoder.

Encoder

The encoder consists of a pretrained ResNet18 model, which is used to extract image features. We remove the last fully connected layer of the ResNet18 model and replace it with a new fully connected layer that maps the extracted features to a vector of the desired embedding size. This approach leverages the pretrained ResNet18’s ability to capture rich image representations while allowing us to fine-tune the final mapping to better suit our captioning task.

Decoder

The decoder is a Long Short-Term Memory (LSTM) network, suited for sequence prediction tasks. The decoder takes the image feature vector produced by the encoder as its initial hidden state and generates captions word by word. It incorporates an embedding layer to transform word indices into dense vectors, an LSTM layer to model the sequence dependencies, and a fully connected layer to map the LSTM outputs to the vocabulary space.

Combined Model

The encoder and decoder are combined into a single end-to-end trainable model. During training, the forward method:

  1. Passes an image through the encoder to generate image features.
  2. Uses these features as the initial input to the decoder.
  3. The decoder, given the image features and the sequence of previously generated words (during training, this sequence includes the ground truth captions), predicts the next word in the sequence.
In [ ]:
import torch.nn as nn
import torchvision.models as models
import torch.utils.checkpoint as cp


# Encoder class to extract features from input image using a pretrained ResNet model
class Encoder(nn.Module):

    def __init__(self, embed_size):
        super(Encoder, self).__init__()
        # Load pretrained ResNet18 Model
        resnet = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
        # Remove the last fully connected layer
        modules = list(resnet.children())[:-1]
        # Create a new sequential model
        self.resnet = nn.Sequential(*modules)
        # Freeze all layers in the encoder
        for param in self.resnet.parameters():
            param.requires_grad = False
        # new fully connected layer with defined embedding size
        self.embed = nn.Linear(resnet.fc.in_features, embed_size)

    def forward(self, images):
        # Extract features from the input image
        features = self.resnet(images)
        # resize features to batch size
        features = features.view(features.size(0), -1)
        # pass features through the fully connected layer to get embeddings of defined size
        features = self.embed(features)
        return features

# Decoder class to generate captions from the extracted features
class Decoder(nn.Module):

    def __init__(self, embed_size, hidden_size, vocab_size, num_layers, pt_emb=True):
        super(Decoder, self).__init__()
        # choose if pretrained embeddings should be used
        if pt_emb:
            self.embed = nn.Embedding.from_pretrained(voc.vectors, freeze=True)
        else:
            self.embed = nn.Embedding(vocab_size, embed_size)
        # define lstm layer
        self.lstm = nn.LSTM(input_size=embed_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
        # define linear layer to transform lstm output to vocab size
        self.linear = nn.Linear(hidden_size, vocab_size)
        # define linear layer to reduce the dimension of the embeddings
        self.reduce_dim = nn.Linear(2*embed_size, embed_size)

    def forward(self, features, captions):
        # embedd the input captions
        embeddings = self.embed(captions)
        # expand the image features to match the sequence length of the captions
        features = features.unsqueeze(1).expand(-1, embeddings.size(1), -1)
        # concatenate the image features and the embeddings
        embeddings = torch.cat((features, embeddings), 2)
        # reduce the dimension of the embeddings
        embeddings = self.reduce_dim(embeddings)
        # pass the embeddings through the lstm layer
        hiddens, _ = self.lstm(embeddings)
        # pass the lstm output through the linear layer to get the output (transforming to vocab size)
        outputs = self.linear(hiddens)
        return outputs

# Class to combine the encoder and decoder into a complete model
class ImageCaptionModel(nn.Module):

    def __init__(self, embed_size, hidden_size, vocab_size, num_layers, pt_emb=True):
        super(ImageCaptionModel, self).__init__()
        # Initialize encoder with defined embedding size
        self.encoder = Encoder(embed_size)
        # Initialize decoder with defined embedding size, hidden size, vocab size and number of layers
        self.decoder = Decoder(embed_size, hidden_size, vocab_size, num_layers, pt_emb=pt_emb)

    def forward(self, images, captions):
        # Extract features from the image using the encoder
        features = self.encoder(images)
        # Generate captions from the features using the decoder
        outputs = self.decoder(features, captions)
        return outputs

    # sample a caption from the model
    def sample(self, image, voc, max_len=22, end_token='<eos>'):
        result_caption = []
        with torch.no_grad():
            x = self.encoder(image).unsqueeze(1)
            states = None
            for _ in range(max_len):
                # pass the image features through the lstm layer
                hiddens, states = self.decoder.lstm(x, states)
                # pass the lstm output through the linear layer to get the output (transforming to vocab size)
                output = self.decoder.linear(hiddens.squeeze(1))
                # get the word with the highest probability
                predicted = output.argmax(1)
                # append the word to the result caption
                result_caption.append(predicted.item())
                # embed the predicted word for the next iteration
                x = self.decoder.embed(predicted).unsqueeze(1)
                # stop if the end token is predicted
                if voc.get_itos()[predicted.item()] == end_token:
                    break
        return result_caption

Model Training¶

In [ ]:
# embedding size for encoder output and decoder input
embed_size = 100
# hidden size for LSTM
hidden_size = 512
# vocab size
vocab_size = len(voc)
# number of LSTM layers
num_layers = 1
# learning rate
learning_rate = 0.001

model = ImageCaptionModel(embed_size, hidden_size, vocab_size, num_layers, pt_emb=True).to(device)
/Users/lukasreber/Repositories/del_mc2/del_mc2/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/Users/lukasreber/Repositories/del_mc2/del_mc2/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
In [ ]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
In [ ]:
num_epochs = 200
model_name = 'v04_200_pt_emb_true'

stats = []
for epoch in range(num_epochs):
    for i, (images, captions) in enumerate(dataloader_train):
        images = images.to(device)
        captions = captions.to(device)

        outputs = model(images, captions)
        targets = captions[:, 1:]
        outputs = outputs[:, :-1, :]
        loss = criterion(outputs.reshape(-1, vocab_size), targets.reshape(-1))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if i % 100 == 0:
            print(f'Epoch {epoch}, Batch {i}, Loss: {loss.item()}')
            stats.append([epoch, i, loss.item()])

with open(os.path.join(data_path, 'Models', f'{model_name}.txt'), 'w') as file:
    for sublist in stats:
        file.write(','.join(map(str, sublist)) + '\n')
In [ ]:
# save the model
torch.save(model, os.path.join(data_path, 'Models', f'{model_name}.pt'))

Training Loss¶

We want to minimize the cross entropy loss between the predicted and the actual captions. The loss is calculated for each word in the caption and then averaged over all words. In order to validate if the model is learning, we will plot the training loss over time.

In [ ]:
# read the stats from the file
emb_false = pd.read_csv(os.path.join(data_path, 'Models/v03_150_pt_emb_false.txt'), names=['epoch', 'batch', 'loss'])
emb_true = pd.read_csv(os.path.join(data_path, 'Models/v03_150_pt_emb_true.txt'), names=['epoch', 'batch', 'loss'])

# filter where the batch is 0
emb_false = emb_false[emb_false['batch'] == 0]
emb_true = emb_true[emb_true['batch'] == 0]
In [ ]:
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
ax.plot(emb_false['epoch'], emb_false['loss'], label='Untrained Embeddings')
ax.plot(emb_true['epoch'], emb_true['loss'], label='Pretrained Embeddings')
ax.legend()
ax.set_title('Training Loss')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
plt.show()
No description has been provided for this image

The visualization shows the training loss (cross entropy loss) over the training epochs. The loss decreases over time, which indicates that the model is learning.

Evaluation¶

Load saved model¶

In order to either continue training the model or to evaluate it, we need to load the model. For this to work, the model class needs to be defined in the notebook. Since we used Google Colab to train the models and are now validating them on our local machine, we need to set the device to 'cpu'.

In [ ]:
model = torch.load('data/Models/v03_200_pt_emb_false.pt', map_location=torch.device('cpu'))

Sample predictions¶

In order to assess the quality of the model, we will generate some sample predictions. First of all we will preprocess the images and then feed them to the model to generate captions.

In [ ]:
def preprocess_image(image_path):
    transform = transforms.Compose([
        transforms.ToTensor(),
    ])
    image = Image.open(image_path).convert('RGB')
    image = transform(image).unsqueeze(0).to(device)
    return image

def sample_prediction(image_path, image_name=None, full_output=True):
    if not image_name:
        image_name = np.random.choice(files_test)

    image_path = os.path.join(image_path, image_name)

    # get grouth truth captions
    gt_captions = captions[captions['image'] == image_name]['caption'].values
    gt_captions_join = '\n'.join(gt_captions)

    try:
        image = preprocess_image(image_path)
        caption_indices = model.sample(image, voc)
        pred_caption = [voc.get_itos()[idx] for idx in caption_indices]
        pred_caption_joined = ' '.join(pred_caption).replace('<pad>', '')

        if full_output:
            img = Image.open(image_path)
            plt.imshow(img)
            plt.axis('off')
            plt.title(f'Caption: {pred_caption_joined}')
            plt.text(5, 5, gt_captions_join, fontsize=8, horizontalalignment='left', verticalalignment='top', color='white', bbox=dict(facecolor='black'))
            plt.show()
        else:
            return pred_caption, gt_captions
    except Exception as e:
        print(f'Error: {e}')
In [ ]:
sample_prediction(image_path_prepared)
No description has been provided for this image
In [ ]:
sample_prediction(image_path_prepared,'3387542157_81bfd00072.jpg')
No description has been provided for this image

The captions generated by the model are somewhat useles. The model seems to always predict the word "worn" at the beginning, and the rest of the caption seems to be quite random. It is unclear why exactly this is the case. I would suspect that this does not have to do with the training duration but rather with the architecture of the model. I'm not sure if the model, especially the decoder part, handles the data correctly.

BLEU Score¶

In order to statistically evaluate the models, we will use the BLEU score. The BLEU (Bilingual Evaluation Understudy) score is a metric that compares the generated captions to the ground truth captions. The BLEU score ranges from 0 to 1, where 1 is the best possible score. While the BLEU score is widely used, including in the original paper, that we are trying to replicate, it is not perfect. Since it is based on n-grams, it does not take into account the meaning of the words, which can lead to high BLEU scores for nonsensical captions. Furhtermore the calculation of the score does not evaluate the context of the captions, which means it is unable to detect synonyms or similar words.

We will calculate the BLEU score for the validation set using different n-gram values. N-Gram refers to the number of words that are considered at once. For example, a 2-gram would consider pairs of words, while a 3-gram would consider triplets of words.

In [ ]:
def calc_score(img, metric):
    pred, gt = sample_prediction(image_path_prepared, img, full_output=False)
    # List of tokens to be removed
    tokens_to_remove = ['<eos>', '<pad>']
    for token in tokens_to_remove:
        if token in pred:
            pred.remove(token)

    score = metric(pred, gt).item()

    return pred, gt, score

metric = BLEUScore(n_gram=1)

# Example
calc_score('1000268201_693b08cb0e.jpg', metric)
Out[ ]:
(['bundled', 'in', 'on'],
 array(['A child in a pink dress is climbing up a set of stairs in an entry way .',
        'A girl going into a wooden building .',
        'A little girl climbing into a wooden playhouse .',
        'A little girl climbing the stairs to her playhouse .',
        'A little girl in a pink dress going into a wooden cabin .'],
       dtype=object),
 0.0)
In [ ]:
scores = []
for file in files_train:
    pred, gt, score = calc_score(file, metric)
    scores.append(score)
    # if score > 0.1:
    #     print(f'File: {file}, Score: {score}')
In [ ]:
scores_not_null = [score for score in scores if score > 0]
plt.hist(scores_not_null, bins=20)
plt.title('BLEU Score Distribution')
plt.xlabel('BLEU Score')
plt.ylabel('Number of Images')
plt.show()
No description has been provided for this image

Since we already saw in the sample predictions that the model is not working as expected, we can expect the BLEU score to be quite low. In fact the scores are almost all 0 even on 1-gram level. This indicates that the model is not able to generate meaningful captions. The figure displays the distribution of the BLEU scores where the results with 0 are removed. Overall the results average out to almost 0.

Discussion & Conclusion¶

As evident from the sample predictions and the BLEU scores, the model is not working as expected. The captions generated by the model are nonsensical and the BLEU scores are almost 0. As stated earlier, this is likely due to the implementation of the decoder, LSTM part. As discussed during the presentation, one major flaw of the model is that it takes the previous generated word to predict the next word. This leads to the model not beeing able to learn properly since the first word likely already wrong. A better approach would be to use always the words from the captions during training and only use the generated words during inference. This would likely help the model learn better.

Additionally to reach a better prediction earlier in the training, the words which only occur a few times in the dataset could be removed and replaced with the "unknown" token. This would reduce the vocabulary size and make the model learn faster.